Let’s start by looking at breast cancer data, by census tract, in western Washington state. We’ll use the data to make three plots:
It’s hard to see exactly what’s going on in the smaller census tracts, so let’s try zooming in.
Both plots indicate that the incident rate of breast cancer is somewhere betweeen 0.5% to 2.5%, though we do see one outlier with a rate of about 4%. Additionally, there doesn’t seem to be any clear pattern between rural areas and the more metropolitan areas around Seattle and Tacoma.
And another quick zoom to make sure we don’t lose the little guys.
Interestingly, the image shows that most of the wealthiest areas are just outside of the cities, not actually in the cities themselves.
We will make our metric the product of the incident cancer rate and the median household income. The higher values will be census tracts in which either the cancer rate is higher than usual, and/or the median income is larger than usual.
Our image shows brighter spots in the wealthier areas, suggesting that wealthier communities tend to have higher rates of breast cancer.
It’s also worth taking a look at a plot of the two variables
## Warning: Removed 711 rows containing non-finite values (stat_boxplot).
There does appear to be a slightly increasing trend, indicating that as the median household income increases, so does the incident rate of breast cancer. We can find the difference using an ANOVA test. We can set up a quick hypothesis test:
\(H_0\): There is no increase between cancer rate and income quantiles.
\(H_A\): There is an increase between cancer rate and income quantiles.
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(income_quantile) 4 0.000203 5.087e-05 6.971 1.59e-05 ***
## Residuals 881 0.006430 7.300e-06
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 711 observations deleted due to missingness
With a p-val less than 0.05, we reject \(H_0\) in favor of \(H_A\). This means our data suggests that there is a statistically significant difference for the cancer rate, just a few fractions of a percent, between each income quantile.
We’ll now turn our attention to the 2000 U.S presidential election featuring republican candidate (and later, president) George W. Bush and democtratic candidate Al Gore. The data is from the School of Public Affairs at American University in DC, and keeps track of the number of votes per candidate by county. (Note: The county map is from 2010, whereas the election data is from 2000, so a few counties will not match up. The number is not that large.)
Using this data, we’ll make a map showing the proportion of voters who voted for Bush at a state level and at a county level. We’ll have the red indicate the regions that favor Bush, the blue indicate the regions that don’t favor him, and the white represent the regions that remained in the middle.
With our data set, we will consider:
We’ll start off at the state level, where we see quite a bit of red in the middle and light blue on the coasts.
Now we focus our attention at the county level, to get a better sense of individual regions’ preferences.
Now with the maps on the table, we can take a stab at answering some of the questions.
Right off the bat we notice that looking at the counties map, we see a lot of red! It’s difficult to get a sense for exactly how “blue” or “red” a state is since many of the counties that have the largest area (and are very sparsely populated) are red. The dark blue counties that rarely show up have a lot more weight than their size implies since they represent much more densley populated metropolitan areas.
We’ll take the example of California. If we look at the counties map, most of California is covered in red or pink except for the LA and SF regions. But looking at the state map, we see that overall the state voted democrat.
(Image credit to “Ninenations” by A Max J.)
The counties map seems to fit well with the Nine Nations of North America idea. Starting on the NW coast, we see a lot more blue along the coast, corresponding to ‘Ecotopia’. The ‘Empty Quarter’ is, for the most part, made up of large republican counties. Further south in ‘Mexamerica’, these counties start gaining some blue. This democrat skew is also seen in the “nation” of ‘New England’. We look at the state map to spot the differences between the lightly-blue ‘Foundry’ and the red ‘Dixie’. Finally, it’s hard to get a sense for the “Breadbasket”, since the political orientation doesn’t seem as consistent as in the other “nations”.
We get a sense for the most conflicted states by looking at the standard deviation the voting percentage in each county.
The plot above shows the ten states with the highest standard deviation in percentage having voted republican vs. democrat. In other words, the plot shows the states that seem to be most ‘divided’ in terms of their political beliefs. We see that many of these states have both
at least one very big city and
huge regions that are mostly unpopulated.
If we look at the states on the above map, we confirm that states like New Mexico, Oregon, Texas, California, Alabama, etc. all seem to share these features.